PHP trim函数剖析

这周遇到了一个关于 trim 函数的问题,问题是这样产生的,由于业务上需要计算购买人数,并对购买人数做特殊展示处理,规则是这样的购买人数大于 10000 展示 xx.x万 人,小于 10000 则展示原数据,产品还有一个特殊需求那就是如果计算结果刚好是 xx.0 那么 .0 是不需要的,所以当时是这么处理的 $buyNum = trim(round($buyNum / 10000, 1), '.0'); ,当数据是诸如 16.0 这样的时候是没有问题的,但是 10.0 这种数据就有问题了,由于 trim 函数特性,最终得到的结果是 1,和我们的预期大相径庭,那么这究竟是怎么一回事呢?那我们就一步步来探究它吧,彻底搞懂它,避免之后再踩坑。以下 PHP源码 基于 PHP7.1.6.

我们先看一下 PHP 官方文档 trim 的解释:

  • trim 函数原型:

    1
    trim ( string $str [, string $character_mask = " \t\n\r\0\x0B" ] ) : string
  • trim 函数说明:

    This function returns a string with whitespace stripped from the beginning and end of str. Without the second parameter, trim() will strip these characters:

    • “ “ (ASCII 32 (0x20)), an ordinary space.
    • “\t” (ASCII 9 (0x09)), a tab.
    • “\n” (ASCII 10 (0x0A)), a new line (line feed).
    • “\r” (ASCII 13 (0x0D)), a carriage return.
    • “\0” (ASCII 0 (0x00)), the NUL-byte.
    • “\x0B” (ASCII 11 (0x0B)), a vertical tab.
  • trim 参数说明:
    • str
      The string that will be trimmed.
    • character_mask
      Optionally, the stripped characters can also be specified using the character_mask parameter. Simply list all characters that you want to be stripped. With .. you can specify a range of characters.

PHP官方 文档说的是这个函数的一个简单用法,没有第二个参数的时候,trim 函数默认去除 ' '(空格)\t(水平制表符)\n(换行符)\r(回车符)\0(空字节符)\x0B/\v(垂直制表符)等几种字符。加了第二个参数以后就是去除所指定的字符,通过 .. 可以指定范围,看这个解释那么如下两个例子应该得到什么结果呢?

1
2
3
4
5
6
<?php
$str = 'Hello World';
$a = trim($str, 'Hdle');
$b = trim($str, 'HdWr');
var_dump($a);
var_dump($b);

结果应该是:

1
2
string(5) "o Wor"
string(7) "ello ol"

执行以后呢?

1
2
string(5) "o Wor"
string(9) "ello Worl"

和我们预期的结果区别不小,这是怎么回事呢?
回到我的那个处理购买人数问题上:

1
2
3
4
5
<?php
$buyNum = 99907;
$buyNum = trim(round($buyNum / 10000, 1), '.0');
var_dump($buyNum);
//string(1) "1"

有点奇怪吧,甚至是有点迷惑,这到底是什么意思呢?当然这个问题不用这个方法也能处理,但是 trim 不能处理或者说处理的不对到底是什么情况呢?看 PHP 官方 文档那应该结果就是 1 ,再看文档没有其他说明了,那就不看了吗?No!我们可以看 trim 源码实现,探究它的本质,真正了解它的实现原理,之后就不会再犯同样的错误,同样在别人说这个 trim 函数不好用有坑的时候你能知道为什么会有坑,坑是怎么产生的。

trim 源码实现在 php-7.1.6/ext/standard/string.c 中的 php_trim 方法,核心代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
/* {{{ php_trim()
* mode 1 : trim left
* mode 2 : trim right
* mode 3 : trim left and right
* what indicates which chars are to be trimmed. NULL->default (' \t\n\r\v\0')
*/
PHPAPI zend_string *php_trim(zend_string *str, char *what, size_t what_len, int mode)
{
const char *c = ZSTR_VAL(str);
size_t len = ZSTR_LEN(str);
register size_t i;
size_t trimmed = 0;
char mask[256];

if (what) {
if (what_len == 1) {
char p = *what;
if (mode & 1) {
for (i = 0; i < len; i++) {
if (c[i] == p) {
trimmed++;
} else {
break;
}
}
len -= trimmed;
c += trimmed;
}
if (mode & 2) {
if (len > 0) {
i = len - 1;
do {
if (c[i] == p) {
len--;
} else {
break;
}
} while (i-- != 0);
}
}
} else {
php_charmask((unsigned char*)what, what_len, mask);

if (mode & 1) {
for (i = 0; i < len; i++) {
if (mask[(unsigned char)c[i]]) {
trimmed++;
} else {
break;
}
}
len -= trimmed;
c += trimmed;
}
if (mode & 2) {
if (len > 0) {
i = len - 1;
do {
if (mask[(unsigned char)c[i]]) {
len--;
} else {
break;
}
} while (i-- != 0);
}
}
}
} else {
if (mode & 1) {
for (i = 0; i < len; i++) {
if ((unsigned char)c[i] <= ' ' &&
(c[i] == ' ' || c[i] == '\n' || c[i] == '\r' || c[i] == '\t' || c[i] == '\v' || c[i] == '\0')) {
trimmed++;
} else {
break;
}
}
len -= trimmed;
c += trimmed;
}
if (mode & 2) {
if (len > 0) {
i = len - 1;
do {
if ((unsigned char)c[i] <= ' ' &&
(c[i] == ' ' || c[i] == '\n' || c[i] == '\r' || c[i] == '\t' || c[i] == '\v' || c[i] == '\0')) {
len--;
} else {
break;
}
} while (i-- != 0);
}
}
}

if (ZSTR_LEN(str) == len) {
return zend_string_copy(str);
} else {
return zend_string_init(c, len, 0);
}
}

函数上面的注释说明了这个函数的参数含义:

  • str:原字符串
  • what:需要去除的指定字符串
  • what_len:需要去除的指定字符串长度
  • mode:去除类型,左去除,右去除,左右去除

trim 函数处理逻辑:

  1. 判断是否设置去除内容 what,没有设置去除默认字符(' \t\n\r\v\0');
  2. 判断去除内容长度,1个字符和多个字符去除;
  3. 使用 mode12 按位与运算判断左右去除;
  4. trim 多个字符去除,是循环去除,直到遇到第一个不在列表中的字符。

这里我们看多字符去除,单字符去除没有歧义,主要是对多字符去除有疑惑,多字符去除主要处理在 php_charmask 函数,定义如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
/* {{{ php_charmask
* Fills a 256-byte bytemask with input. You can specify a range like 'a..z',
* it needs to be incrementing.
* Returns: FAILURE/SUCCESS whether the input was correct (i.e. no range errors)
*/
static inline int php_charmask(unsigned char *input, size_t len, char *mask)
{
unsigned char *end;
unsigned char c;
int result = SUCCESS;

memset(mask, 0, 256);
for (end = input+len; input < end; input++) {
c=*input;
if ((input+3 < end) && input[1] == '.' && input[2] == '.'
&& input[3] >= c) {
memset(mask+c, 1, input[3] - c + 1);
input+=3;
} else if ((input+1 < end) && input[0] == '.' && input[1] == '.') {
/* Error, try to be as helpful as possible:
(a range ending/starting with '.' won't be captured here) */
if (end-len >= input) { /* there was no 'left' char */
php_error_docref(NULL, E_WARNING, "Invalid '..'-range, no character to the left of '..'");
result = FAILURE;
continue;
}
if (input+2 >= end) { /* there is no 'right' char */
php_error_docref(NULL, E_WARNING, "Invalid '..'-range, no character to the right of '..'");
result = FAILURE;
continue;
}
if (input[-1] > input[2]) { /* wrong order */
php_error_docref(NULL, E_WARNING, "Invalid '..'-range, '..'-range needs to be incrementing");
result = FAILURE;
continue;
}
/* FIXME: better error (a..b..c is the only left possibility?) */
php_error_docref(NULL, E_WARNING, "Invalid '..'-range");
result = FAILURE;
continue;
} else {
mask[c]=1;
}
}
return result;
}

php_charmask 函数使用一个 mask 字节数组来标记那些需要去除的字符串,然后执行操作和去除一个字符类似,只是结束条件是寻找到第一个不在字符表里的元素。同时我们也能看到函数对于范围去除的处理,也就是 trim 函数第二个参数中的 .. ,同时也说明了在使用 trim 函数时第二个参数不能有三个点 ... 否则会报错。
了解了 trim 函数内部实现原理以后,下面我们来通过 GDB 跟踪一下函数内部实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
 gdb php
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/local/php71/bin/php...done.
(gdb) b php_charmask
Breakpoint 1 at 0x730705: php_charmask. (4 locations)
(gdb) r ~/Code/PHP/trim.php
Starting program: /bin/php ~/Code/PHP/trim.php
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, php_trim (str=0x7ffff1c03560, what=0x7ffff1c03598 ".0",
what_len=2, mode=3) at /usr/src/php-7.1.6/ext/standard/string.c:829
829 php_charmask((unsigned char*)what, what_len, mask);
(gdb) s
php_charmask (mask=0x7fffffffaa30 "", len=2, input=0x7ffff1c03598 ".0")
at /usr/src/php-7.1.6/ext/standard/string.c:751
751 memset(mask, 0, 256);
(gdb) n
752 for (end = input+len; input < end; input++) {
(gdb) p input
$1 = (unsigned char *) 0x7ffff1c03598 ".0"
(gdb) p *input
$2 = 46 '.'
(gdb) n
751 memset(mask, 0, 256);
(gdb)
752 for (end = input+len; input < end; input++) {
(gdb)
751 memset(mask, 0, 256);
(gdb)
752 for (end = input+len; input < end; input++) {
(gdb)
761 if (end-len >= input) { /* there was no 'left' char */
(gdb)
php_trim (str=0x7ffff1c03560, what=<optimized out>, what_len=<optimized out>,
mode=<optimized out>) at /usr/src/php-7.1.6/ext/standard/string.c:829
829 php_charmask((unsigned char*)what, what_len, mask);
(gdb) s
php_charmask (mask=0x7fffffffaa30 "", len=<optimized out>,
input=0x7ffff1c03598 ".0") at /usr/src/php-7.1.6/ext/standard/string.c:754
754 if ((input+3 < end) && input[1] == '.' && input[2] == '.'
(gdb)
753 c=*input;
(gdb) p c
$3 = <optimized out>
(gdb) p *input
$4 = 46 '.'
(gdb) n
754 if ((input+3 < end) && input[1] == '.' && input[2] == '.'
(gdb) n
758 } else if ((input+1 < end) && input[0] == '.' && input[1] == '.') {
(gdb) n
797 size_t len = ZSTR_LEN(str);
(gdb) p str
$5 = (zend_string *) 0x7ffff1c03560
(gdb) p *str
$6 = {gc = {refcount = 0, u = {v = {type = 6 '\006', flags = 2 '\002',
gc_info = 0}, type_info = 518}}, h = 9223372043238031460, len = 4,
val = "1"}
(gdb) p len
$7 = 4
(gdb) n
829 php_charmask((unsigned char*)what, what_len, mask);
(gdb) s
php_charmask (mask=0x7fffffffaa30 "", len=<optimized out>,
input=0x7ffff1c03599 "0") at /usr/src/php-7.1.6/ext/standard/string.c:754
754 if ((input+3 < end) && input[1] == '.' && input[2] == '.'
(gdb)
753 c=*input;
(gdb)
754 if ((input+3 < end) && input[1] == '.' && input[2] == '.'
(gdb)
758 } else if ((input+1 < end) && input[0] == '.' && input[1] == '.') {
(gdb)
781 mask[c]=1;

(gdb) p c
$8 = 48 '0'
(gdb) n
752 for (end = input+len; input < end; input++) {
(gdb)
php_trim (str=0x7ffff1c03560, what=<optimized out>, what_len=<optimized out>,
mode=<optimized out>) at /usr/src/php-7.1.6/ext/standard/string.c:831
831 if (mode & 1) {
(gdb)
797 size_t len = ZSTR_LEN(str);
(gdb) n
831 if (mode & 1) {
(gdb) n
832 for (i = 0; i < len; i++) {
(gdb) n
833 if (mask[(unsigned char)c[i]]) {
(gdb) n
839 len -= trimmed;
(gdb) n
840 c += trimmed;
(gdb) n
839 len -= trimmed;
(gdb) n
842 if (mode & 2) {
(gdb) n
843 if (len > 0) {
(gdb) n
846 if (mask[(unsigned char)c[i]]) {
(gdb) p c
$10 = 0x7ffff1c03578 "10.0"
(gdb) p c[i]
$11 = 48 '0'
(gdb) p i
$12 = 3
(gdb) n
851 } while (i-- != 0);
(gdb) n
846 if (mask[(unsigned char)c[i]]) {
(gdb) p c[i]
$13 = 46 '.'
(gdb) p i
$14 = 2
(gdb) n
851 } while (i-- != 0);
(gdb) n
846 if (mask[(unsigned char)c[i]]) {
(gdb) p c[i]
$15 = 48 '0'
(gdb) n
851 } while (i-- != 0);
(gdb) n
846 if (mask[(unsigned char)c[i]]) {
(gdb) p c[i]
$16 = 49 '1'
(gdb) p i
$17 = 0
(gdb) n
883 if (ZSTR_LEN(str) == len) {
(gdb) p len
$18 = 1
(gdb) n
886 return zend_string_init(c, len, 0);
(gdb) p c
$19 = 0x7ffff1c03578 "10.0"
(gdb)
$20 = 0x7ffff1c03578 "10.0"
(gdb) p *c
$21 = 49 '1'

最终验证了我们的结论,也加深了对 trim 函数的理解,这里做个延伸,由于 trim 是基于字节去除的,所以在去除中文的时候可能会出现乱码,这是由于汉字是 UTF-8 编码,一个汉字占 3字节,所以可能会出现乱码,知道了函数实现原理以及实现细节可以避免踩很多坑。

1
2
>>> trim('品、', '、')
=> b"å“"